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Iteratively Reweighted ii Approaches to Sparse 
Composite Regularization 

Rizwan Ahmad and Philip Schniter, Fellow, IEEE 


Abstract —Motivated by the observation that a given signal x 
admits sparse representations in multiple dictionaries At d but 
with varying levels of sparsity across dictionaries, we propose 
two new algorithms for the reconstruction of (approximately) 
sparse signals from noisy linear measurements. Our first algo¬ 
rithm, Co-Ll, extends the well-known lasso algorithm from the 
LI regularizer ||^'a3||i to composite regularlzers of the form 
Ad||^tja:||i while self-adjusting the regularization weights 
Ad. Our second algorithm, Co-IRW-Ll, extends the well-known 
iteratively re weigh ted LI algorithm to the same family of 
composite regularlzers. We provide several Interpretations of 
both algorithms: i) majorization-minimization (MM) applied 
to a non-convex log-sum-type penalty, ii) MM applied to an 
approximate ^o-type penalty, iil) MM applied to Bayesian MAP 
inference under a particular hierarchical prior, and Iv) varia¬ 
tional expectation-maximization (VEM) under a particular prior 
with deterministic unknown parameters. A detailed numerical 
study suggests that our proposed algorithms yield significantly 
improved recovery SNR when compared to their non-composite 
LI and IRW-Ll counterparts. 

I. Introduction 

We consider the problem of recovering the signal (or image) 
X € from noisy linear measurements of the form 

y = $a; + ut€C^, (1) 

where $ G i^mxn ^ known measurement operator and 
w G is additive noise. Such problems arise in imaging, 

machine learning, radar, communications, speech, and many 
other applications. We are particularly interested in the case 
that M N, where x cannot be uniquely determined from 
the measurements y, even in the absence of noise. This latter 
situation arises in many of the aforementioned applications, as 
well as in broad area of signal recovery methods associated 
with compressive sensing (CS) H]. 

A. Regularized (. 2 -Minimization 

By incorporating (partial) prior knowledge about the signal 
and noise power, it may be possible to accurately recover x 
from M N measurements y. In this work, we consider 
signal recovery based on optimization problems of the form 

argmin 7 ||y — $at ||2 + i?(a;) (2) 

X 
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where 7 is a tuning parameter that reflects knowledge of 
the noise level and R{x) is a penalty, or regularization, that 
reflects prior knowledge about the signal x 10 . We briefly 
summarize several common instances of R{x) below. 

1) If X is known to be sparse (i.e., contains sufficiently few 

non-zero coefficients) or approximately sparse, then one 
would ideally like to use the (q penalty (i.e., counting 
“norm”) R{x) = ||at||o = lk„|>o, where Ipj is 

the indicator function. However, since this choice makes 
(lU NP-hard, it is not often used in practice. 

2) The (i penalty, R{x) = ||a;||i = J2n=i \^n\, is a well- 
known surrogate to the (q penalty that renders (| 2 ]i convex, 
and thus amenable to polynomial-time solution. In this 
case, © is known as the basis pursuit denoising 0 or 
lasso a problem, which is commonly used in synthesis- 
based CS 0 . 

3) Various non-convex surrogates for the (q penalty have 

also been considered, such as the (p penalty R{x) = 
ll^llp = J2n=i with p G (0,1) and the log-sum 
penalty R{x) = + l^"l) e > 0. Al¬ 

though © becomes difficult to solve, it can be tractably 
approximated. See 0 for a more complete discussion. 

4) The choice R{x) = ||\Pa;||i, with known matrix G 
(j^LxN, to analysis-based CS Q and the general¬ 
ized lasso 0 . Penalties of this form are appropriate when 
prior knowledge suggests that the transform coefficients 
fl/tc are (approximately) sparse, as opposed to the signal 
X itself being sparse. When is a finite-difference 
operator, ||’®'a;||i yields anisotropic total variation reg¬ 
ularization O. 

5) Non-convex penalties can also be placed on the transform 

coefficients Atx, leading to, e.g., R{x) = ||\Pa;||P = 
El=i with p G (0,1) or R{x) = log(e -f 

\xpjx\) with e > 0 . 

A popular approach to solve © with a non-convex penalty 
R{x) is through iteratively reweighted (i (IRW-Llj^ ©■ 
There, © with fixed non-convex R{x) is approximated by 
solving a sequence of convex problems 

= argmin 7 ||y — ^•a ;||2 + R^*\x), (3) 

X 

where, at iteration t, the penalty R^*\x) = 
with each weight Wn'^ set based on the previous estimate 
Constrained formulations of IRW-Ll based on “x^*^ = 
argmina; s.t. \\y — $tc|l 2 < S,” have also been con¬ 

sidered, such as in Ha-in. Many of the papers cited above 

^Iteratively reweighted £2 is a populai' alternative, e.g., m tni 
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show empirical results where the performance of IRW-Ll 
surpasses that of standard 

B. Sparsity-Inducing Composite Regularizers 

In this work, we focus on sparsity-inducing composite 
regularizers of the form 

D 

<(a;;A)4^Ad||«'da;||i, (4) 

d=l 

where each G i^LdxN ^ known analysis operator and 
Ad > 0 is a corresponding regularization weight. Our goal is 
to recover the signal x from measurements ([T]i by optimizing 
(l2]i with the composite regularizer (IHi. Doing so requires an 
optimization of the weights A = [Ai,... , A^]^ in (|4]i. We 
are also interested in iteratively re-weighted extensions of this 
problem that, at iteration t, use composite regularizers of the 
foriiJl 

D 

d^l 

where are diagonal matrices. This latter approach re¬ 
quires the optimization of both and for all d. 

As a motivating example, suppose that {^d} is a collection 
of orthonormal bases that includes, e.g., spikes, sines, and 
various wavelet bases. The signal x may be sparse in some 
of these bases, but not all. Thus, we would like to adjust each 
Ad in (IDi to appropriately weight the contribution from each 
basis. But it is not clear how to do this, especially since x 
is unknown. As another example, suppose that x contains a 
(rasterized) sequence of images and that ||’S'ia;||i measures 
temporal total-variation while ||\Ir 2 a:||i measures spatial total- 
variation. Intuitively, we would like to weight these two 
regularizations differently, depending on whether the image 
pixels vary more in the temporal or spatial dimensions. But it 
is not clear how to do this, especially since x is unknown. 

C. Contributions 

In this work, we propose novel iteratively reweighted ap¬ 
proaches to sparse reconstruction based on composite regu¬ 
larizations of the form (llli-Q with automatic tuning of the 
regularization weights A and Wd- For each of our proposed 
algorithms, we will provide four interpretations: 

1) MM applied to a non-convex log-sum-type penalty, 

2) MM applied to an approximate fp-type penalty, 

3) MM applied to Bayesian MAP inference based on 
Gamma and Jeffrey’s hyperpriors II15I16I . and 

4) variational expectation maximization (VEM) II17I18I ap¬ 
plied to a Laplacian or generalized-Pareto prior with 
deterministic unknown parameters. 

We show that the MM interpretation guarantees convergence 
in the sense of satisfying an asymptotic stationary point con¬ 
dition US). Moreover, we establish connections between our 
proposed approaches and existing IRW-Ll algorithms, and we 

^Although is over-parameterized, the form of is convenient for 
algorithm development. 


provide novel VEM-based and Bayesian MAP interpretations 
of those existing algorithms. 

Einally, through the detailed numerical study in Sec. lIVI we 
establish that our proposed algorithms yield significant gains 
in recovery accuracy relative to existing methods with only 
modest increases in runtime. In particular, when {^d} are 
chosen so that the sparsity of varies with d, this structure 
can be exploited for improved recovery. The more disparate 
the sparsity, the greater the improvement. 

D. Related Work 

As discussed above, the generalized lasso 0 is one of 
the most common approaches to LI-regularized analysis-CS 
a, i.e., the optimization (|2]i under the regularizer R{x) = 
||\I/a;||i. The Co-Ll algorithm that we present in Sec. |II] 
can be interpreted as a generalization of this LI method 
to composite regularizers of the form (|4]i. Meanwhile, the 
iteratively reweighted extension of the generalized lasso, IRW- 
Ll a, often yields significantly better reconstruction accuracy 
with a modest increase in complexity (e.g., iniH). The 
Co-IRW-Ll algorithm that we present in Sec. m can be 
interpreted as a generalization of this IRW-Ll method to 
composite regularizers of the form ©. The existing non¬ 
composite LI and IRW-Ll approaches essentially place an 
identical weight Ad = 1 on every term in (|4]i-(|5]l, and thus 
make no attempt to leverage differences in the sparsity of the 
transform coefficients '^dX across the sub-dictionary index d. 
However, the numerical results that we present in Sec. m 
suggest that there can be significant advantages to optimizing 
Ad, which is precisely what our methods do. 

The problem of optimizing the weights Ad of composite reg¬ 
ularizers R{x;X) = J2d^d^d-{x) is a long-standing problem 
with a rich literature (see, e.g., the recent book lISOl l. However, 
the vast majority of that literature focuses on the Tikhonov 
case where Rd{x) are quadratic (see, e.g., 11211 - 11241 ). One 
notable exception is ED, which assumes continuously dif¬ 
ferentiable Rdix) and thus does not cover our composite 
prior (Ell. Another notable exception is ll2^ . which assumes i) 
the availability of a noiseless training example of x to help 
tune the LI regularization weights A in (Ell, and ii) the trivial 
measurement matrix $ = J. In contrast, our proposed methods 
operate without any training and support generic measurement 
matrices $. 

In the special case that each (Ld is composed of a subset of 
rows from the N x N identity matrix, the regularizers (l4|i-(0 
can induce group sparsity in the recovery of x, in that certain 
sub-vectors Xd — ^dX of x are driven to zero while others 
are not. The paper ll27l develops an IRW-Ll-based approach to 
group-sparse signal recovery for equal-sized non-overlapping 
groups that can be considered as a special case of the Co-Ll 
algorithm that we develop in Sec. HI] However, our approach 
is more general in that it handles possibly non-equal and/or 
overlapping groups, not to mention sparsity in a generic set of 
sub-dictionaries ’®'d. Recently, Bayesian MAP group-sparse 
recovery was considered in Il28l . However, the technique 
described there uses Gaussian scale mixtures or, equivalently, 
weighted-L2 regularizers R{x;\) = X^d while our 

methods use weighted-^i regularizers (l4|i-(l5]l. 
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E. Notation 

We use boldface capital letters like for matrices, boldface 
small letters like x for vectors, and (•)^ for transposition. We 
use ||a:||p = for the ip norm of x, with Xn 

representing the coefficient in x and p > 0. We then 

use ||£c||o = limp^oX^n \^n\^ ifT^ when referring to the (.q 
quasi-norm, which counts the number of nonzero coefficients 
in X. We define the “mixed quasi-norm” with p > 0 a^ 
limq^oX]d(Sj and the “mixed fo,o quasi-norm” as 

limp,,^oSd(S/ \xd,i\^)‘^- We use Vp(£c) for the gradient of 
a functional g{x) with respect to x, and 1a for the indicator 
function that returns the value 1 when A is true and 0 when 
A is false. We use p(x; A) for the pdf of random vector x 
under deterministic parameters A, and p(£c|A) for the pdf of 
X conditioned on the random vector A. We use DY,\_{q\\p) to 
denote the Kullback-Leibler (KL) divergence of pdf p from 
pdf q, and we use R and C to denote the real and complex 
fields, respectively. 

II. The Co-Ll Algorithm 

We first propose the Composite-Ll (Co-Ll) algorithm, 
which is summarized in Algorithm [T] There, Ld denotes the 
number of rows in 


Algorithm 1 The Co-Ll Algorithm 

1 : input: {^d}d=i’ y, 7 > 0, e > 0 
2: if Atdx G use Cd = 1; if Atdx G use Cd = 2. 
3: initiaiization; = 1 Vd 
4: for f = 1,2,3,... 

D 

5: ^ argmin 7 ||y - ^x\\l + ^ 

X j . 


6 ^ 

7: end 
8: output: 


CdLd 


: -f ||^d£c(*)||i 


d= l,...,iA 


The main computational step of Co-Ll is the L2 h-L 1 min¬ 
imization in line |5] which can be recognized as (| 2 ]i under 
the composite regularizer from ( 0 . This is a convex 
optimization problem that can be readily solved by existing 
techniques (e.g., ADMM 0301311 . Douglas-Rachford splitting 
(I32l, MFISTA ||33], NESTA-UP iH, GAMP [351, etc.), the 
specific choice of which is immaterial to this paper. 

Note that Co-Ll requires the user to set a small regulariza¬ 
tion term e > 0 whose role is to prevent the denominator in 
line | 6 ] from reaching zero. For typical choices of iPd and 7 , 
the vector A>dX^*''> will almost never be exactly zero, in which 
case it suffices to set e = 0. Also, Co-Ll requires the user 
to set the measurement fidelity weight 7 . With additive white 
Gaussian noise (AWGN) of variance > 0, the Bayesian 
MAP interpretation discussed in Sec. III-DI suggests setting 
7 = 2 ^ for real-valued AWGN or 7 = ^ for circular 

^Our ipfi and £ 0,0 definitions are motivated by the standard Ip^q mixed 
norm definition (for p, 5 > 0), which is C^dC^l l^d.i 1291 . 


complex-valued AWGN. These are, in fact, the settings that 
we used for all numerical results in Sec. m 

Note line |5] of Algorithm [T] can be equivalently restated as 

D 

^ argmin^ A^‘^||^d®||i s.t. ||y-^a;|| 2 <A ( 6 ) 

d=i 

By equivalent, we mean that, for any 5 > 0, there exists a 7 
for which the solutions of line |3 and (| 6 ]l are identical 1^ . A 
version of this manuscript that focuses on the constrained case 
can be found at ET]. Numerical experiments therein show that 
the performance of Co-Ll using ® with the hand-tuned value 
(5 = O.SVMct^ is very similar to that of Algorithm [T] with 7 
chosen as described above. 

Co-Ll’s update of the weights A, defined by line | 6 ] of 
Algorithm [T] can be interpreted in various ways, as we detail 
below. For ease of explanation, we first consider the case 
where "ifdX is real-valued Vd, and later discuss the complex¬ 
valued case in Sec. HEB 

Theorem 1 (Co-Ll). The Co-Ll algorithm in Algorithm\J}has 
the following interpretations: 

1) MM applied to dJ]) under the log-sum penalty 

D 

= '^Ld\og{e+\\^dx\\i), ( 7 ) 

d^l 

2) as e ^ 0, MM applied to m under the weighted 
i l29l/ penalty 

D 

Ld li|^>ja!||i> 0 ) ( 8 ) 

d=l 

3) MM applied to Bayesian MAP estimation under an 
additive white Gaussian noise (AWGN) likelihood and 
the hierarchical prior 

p(a;|A) = n(Y) exp (-Ad||«'d£c||i) (9) 

d^l ^ ^ 

A ~/./.y. r(o,e-i) (10) 

where Zd = AfdX G R^"^ is i.i.d. Laplacian given Ad, 

and Ad is Gamma distributed with scale parameter 
and shape parameter zero, which becomes Jeffrey’s non- 
informative hyperprior p{Xd) oc lAd>o/Ad when e = 0. 

4) variational EM under an AWGN likelihood and the prior 

p{x-,\) oc f Y j exp(-Ad(||^da;||i +e)), 

d=l ^ ' 

( 11 ) 

which, when e = 0, is i.i.d. Laplacian on Zd = A/dX G 
R^*^ with deterministic scale parameter Ad > 0 . 

Proof: See Sections IlI-AI to IlI-EI below. ■ 

Importantly, the MM interpretation implies convergence (in 
the sense of an asymptotic stationary point condition) when 
e > 0, as detailed in Sec. ims 



















4 


A. Log-Sum MM Interpretation of Co-Ll 


or equivalently 


Consider the optimization problem 

argmin 7 ||y - (a;;e) (12) 

X 

with from (I7]i. Inspired by ifTJl §2.3], we write (fTST i as 
argmin 7 ||y- ^x \\2 + log ( e + Ud,i 

d=l ^ Z =1 

s.t. < Ud.z Vd,/, (13) 

where ; is the hh row of Problem (fOT l is of the form 
argmin 5 (u) s.t. v G C, (14) 

V 

where v = [vJ ^ C is a convex set, 

D 

g{v) =7||y- [0 $]u|| 2 + ^idlog 

d=l '' k^Kd 

is a non-convex penalty, and the set K.d = {fc : < 

k < Li} contains the indices k such that Vk € {ud,i}]}fi- 

Since g{v) is the sum of convex and concave terms, i.e., a 
“difference of convex” (DC) functions, (fT4l i can be recognized 
as a DC program 13^ . Majorization-minimization (MM) 1(19] 
1 ^ is a popular method to attack non-convex problems of this 
form. In particular, MM iterates the following two steps; (i) 
construct a surrogate g{v\ that majorizes g{v) at and 
(ii) update = argmin^gc gi^] By “majorize,” we 

mean that g{v]V^L^ > for all v with equality when 

V = uW. 

Due to the DC form of g{v) in (fTSl l, a majorizing surrogate 
can be constructed by linearizing the concave term about its 
tangent at v'L). In particular, say g{v) = gi(v)-\-g 2 {v), where 
gi is the convex (quadratic) term and g 2 is the concave (log- 
sum) term, and say Vp 2 is the gradient of p 2 w.r.t. v. Then 


E 


Vk 


(15) 


D 


jpCt-i-i) _ argmin 


in 7 ||y- $a;||^ + V 

n • * 


LdT,iA \'^d,ix\ 

( 20 ) 


D 


= argmin 7 ||y- $a ;||2 + ^ (21) 


d=l 


for 


A 


(t-Hi) 


Ld 


( 22 ) 


e-f ’ 

which coincides with Algorithm [T] This establishes Part [T] of 
Theorem [1] 


B. Convergence of Co-Ll 

The paper ifTOll studies the convergence of MM, and includes 
a special discussion of the application of MM to DC program¬ 
ming. In the language of our Sec. III-AI lfT9l establishes that, 
when p 2 is differentiable with a Lipschitz continuous gradient, 
the MM sequence satisfies an asymptotic stationary 

point (ASP) condition. Although this falls short of establishing 
convergence to a local minimum (which is difficult for generic 
non-convex problems), the ASP condition is based on a clas¬ 
sical necessary condition for a local minimum. In particular, 
using S/g{v; d) to denote the directional derivative of p at u 
in the direction d, it is known BOl that u* locally minimizes 
g over C only if v — vf) > 0 for all v £ C. Thus, in 

lfT9l . it is said that satisfies an ASC condition if 


liminf inf 

t—¥-\-00 V^C 


yg{v ^*'>; V — 

||u - u (*)||2 


> 0 . 


(23) 


In our case, p 2 from ([I5]l is indeed differentiable, with 
gradient S/g 2 given by (fTsT l. Moreover, Appendix shows 
that this gradient is Lipschitz continuous when e > 0. Thus, 
the sequence of estimates produced by Algorithm [T] satisfies 
the ASP condition (l23t . 


g{v, A P 2 ('W^*^) + ^g 2 iv^*^y[v - (16) 

majorizes g{v) at and so the MM iterations become 

.y(i-i-i) _ argmin5i(u) -f Vg2{v^*'')^v (17) 

v^C 

after neglecting the u-invariant terms. 

Examining the log-sum term in (fTSl l. we see that 


L 


d{k) 


[Vp2(l’^‘^)]fc = S e + Si 


Gx: 


if d{k) f- 0 


d{k) 


0 


else, 


(18) 


where d(fc) is the index d G {1,D} of the set ICd containing 
k, or 0 if no such set exists. Thus MM prescribes 


D 




= argmin 7 ||y — [0 


LdVk 


d—1 kGK-d 




Jt)-' 

Zd 

( 19 ) 


C. Approximate iig Interpretation of Co-Ll 
In the limit of e —0, the log-sum minimization 

N 

argmin 7 ||y - 4>a;||2 + V log(e -f |a;„|) (24) 

X 

n—l 

for 7 > 0 is known ca to be equivalent to Iq minimization 
argmm7'||y - $a;|| 2 -f ||a;||o (25) 

X 

for some 7 ' > 0. (See Appendix |B] for a proof.) This 
equivalence can be seen intuitively as follows. As e —0, 
the contribution to the regularization term S^=i log(e+ l^^nl) 
from each non-zero Xn remains finite, while that from each 
zero-valued Xn approaches — 00 . Since we are interested in 
minimizing the regularization term, we get a huge reward for 
each zero-valued Xn, or—equivalently—a huge penalty for 
each non-zero Xn- 

To arrive at an £0 interpretation of the Co-Ll algorithm, 
we consider the corresponding optimization problem (fTSI) in 
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the limit that e —0. There we see that the regularization term 
0) from (|7]i yields Ld huge rewards when ||’S'(;a;|| i =0, 
or equivalently Ld huge penalties when || i 7^ 0, for each 
d G {1,... ,D}. Thus, we can interpret Co-Ll as attempting 
to solve the optimization problem which is a weighted 
version of the “£p^q mixed norm” problem from 1^ for p = 1 
and g —> 0. This establishes Part |2] of Theorem [1] 


E. Variational EM Interpretation of Co-Ll 

The variational expectation-maximization (VEM) algorithm 
II17I18I is an iterative approach to maximum-likelihood (ML) 
estimation that generalizes the EM algorithm from ll43l . We 
now provide a brief review of the VEM algorithm and describe 
how it can be applied to estimate A in (fTTTi . 

Eirst, note that the log-likelihood can be written as 


D. Bayesian MAP Interpretation of Co-Ll 
The MAP estimate ||4T] of X from y is 


a^MAP = argmaxp(a:|y) = argmin{ - logp(a;|y)} 

X X 

= argmin { — \ogp{x) — logp(y|a;)} 

X 


= argmin 

X 


\ogp{x)+y\\y 



(26) 

(27) 

(28) 


where (l26l l used the monotonicity of log, (l27l i used Bayes 
rule, and (l28l l used the AWGN likelihood. Note that, for real¬ 
valued AWGN with (7^ variance, 7 = 1 ^, while for circular 
complex-valued AWGN with variance, 7 = ^. 

Next, we derive the — logp(a;) term in (l28l l that re¬ 
sults from the hierarchical prior (l9ll- (fT0l i. Recall that, with 
shape parameter k and scale parameter 0, the Gamma 
pdf an is r(Ad;K,0) = lA,>oAr^0“''exp(-Ad/0)/r(K), 
where r(«;) is the Gamma function. Since r(Ad;K, 0) oc 
lM>oArf“^exp(-Ad/6'), we note that r(A(i;0,00) oc 
lAd>o/Ad, which is Jeffrey’s non-informative hyperprior ifTSl 
l42l for the Laplace scale parameter A^. Then, according to 
(S-dlOll, the prior equals 


p{x)= [ p{x\\)p{\)d\ 


(29) 


(30) 


TT {Ld - 1)! 

i},{2i\\^dxh+e))^‘‘ 


(31) 


which implies that 

D 

— logp(a;) = const -b ^Ldlog(||^da;||i+e). (32) 

d^l 

Equations (|28] t, (l32] t, and Qi imply 


logp(y;A)=y g(a;)logp(y; A)da; 

/■ . M \p{x,yA) q{x 
= / q{x)\og 


q{x) pix\y;X) 
p{x,y;X) 


da; 


= J g(a;) log da; -I- J q{x)\og 


q{x) 


pix\y;X) 


(34) 

(35) 
da;, 


= F{qixy,X) 


= D^iL{q{x)\\p{x\y; X)) 


(36) 


for an arbitrary pdf q{x), where Dm{q\\p) denotes the KL 
divergence of p from q. Because Diq_{q\\p) > 0 for any q and 
p, we see that F{q{x)-,X) is a lower bound on logp(y; A). 
The EM algorithm performs ML estimation by iterating 

q^*\x) = argmin DKi{q{x)\\p{x\y,X^*'>)) (37) 

A(‘+^) = argmaxF(g(‘)(a;); A), (38) 

where the “E” step dJTl i tightens the lower bound and the “M” 
step ( |38] | maximizes the lower bound. 

The EM algorithm places no constraints on q{x), in which 
case the solution to dJTl i is simply q^L(^x) = p{x\y; X^^f j e., 
the posterior pdf of x under A = A*^‘^. In many applications, 
however, this posterior is too difficult to compute and/or use 
in ( l38l l. To circumvent this problem, the VEM algorithm 
constrains q{x) to some family of distributions Q that makes 
(|J7]) - (l38]) tractable. 

Eor our application of the VEM algorithm, we constrain to 
distributions of the form 

g(a;) oc j^im exp (^ilogp(a;|t/; A)^, (39) 

which has the effect of concentrating the mass in q{x) at its 
mode. Plugging this q{x) and p{x,y;X) = p{y\x)p{x; X) 
into (l36l l. we see that the M step (l38]) reduces to 

A(‘+^^ = argmaxlogp(a;; A)| _ (j, (40) 

A la;— 

for - argmaxp(a;|y; A^*)), (41) 

where (|4TI) be interpreted as the E step. Eor the particular 
p{x-, A) in (fTTT i. we have that 


a;MAP = argmin7 ||y- ^x\\l -b (a;; 0). (33) 

X 

Einally, applying the MM algorithm to this optimization 
problem (as detailed in Sec. lITAl i. we arrive at Algorithm [T] 
We note that 1161 proposed to use Gamma and Jeffrey’s 
hyperpriors with MM for total-variation image deblurring, 
although their algorithm is not of the IRW-Ll form. This 
establishes Part |3] of Theorem [T] 


D 

logp{x] A) = const + [■^dlog(Ad) - Ad(||^da:||i +e)], 

d=l 

(42) 

and by zeroing the gradient w.r.t. A, we find that (l40l i becomes 

.(i-Pl) _ _ Ld 


A" 


l’®''i®MAp|ll + ^ 


d=l,...,D. (43) 
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Meanwhile, from < |28] t and (fTTT t, we find that (HTI t becomes 

D 

®MAP = argmin7||y- (44) 

® J 1 

a=l 

In conclusion, our VEM algorithm iterates the steps (l43l l- 
(l44l i. which match the steps in Algorithm [T] This establishes 
Part |4] of Theorem [T] 


E Co-Ll for Complex-Valued 

In Theorem [T] and Sections III-AHII-El real-valued analysis 
outputs ^dX were assumed for ease of explanation. We now 
extend the previous results to the case of complex-valued ^dX. 
Eor this, we focus on the VEM interpretation (recall Part |4] of 
Theorem [U, noting that a similar justification can be made 
based on the Bayesian MAP interpretation. In particular, we 
assume an AWGN likelihood and a complex-valued extension 
of the prior (fTTli : 

D / \ \ 

p(a;; A) oc f ^ j exp (-Ad(||’S'da:||i + e)), (45) 

d=l 

which, when e = 0, is i.i.d. complex-valued Laplacian on 
Zd = 'i'dX € with deterministic scale parameter Xd > 0. 
To show this, we follow the steps in Sec. III-EI up to the log- 
prior in (l42l i. which now becomes 

D 

logp{x; A) = const -f E [2Ldlog(Ad)-Ad(||^da;||i + e)]. 

(46) 


Zeroing the gradient w.r.t. A, we find that the VEM update in 
(|40] | becomes 


A 


(i-tl) 


2Ld 




d=l,...,D, 


(47) 


which is twice as large as the real-valued case in (|4^ . 


G. New Interpretations of the IRW-Ll Algorithm 

The proposed Co-Ll algorithm is related to the analysis- 
CS formulation of the well-known IRW-Ll algorithm a. 
Eor clarity, and for later use in Sec. |III] we summarize this 
latter algorithm in Algorithm|2] and note that the synthesis-CS 
formulation follows from the special case that IP = /. 


Algorithm 2 The IRW-Ll Algorithm 


1 : input: 'S' = ..., , S', y, 7 > 0, e > 0 

2 : initialization: = I 

3 : for f = 1,2,3,... 

4: •<— argmin 7 ||y — + ||W^*^’S'a;||i 


5: ^ diag 

6 : end 


1 

e 4- \xl}[xd) \ ’ 


1 

e -I- \%lj\x^'^'i\ 


7: output: a;(*) 


Comparing Algorithm |2] to Algorithm [T] we see that IRW- 
Ll coincides with real-valued Co-Ll in the case that every 
sub-dictionary 'S'^ has dimension one, i.e., Cd = 1 = LdVd 
and D = L, where L = denotes the total number 

of analysis coefficients. Thus, the Co-Ll interpretations from 
Theorem [U can be directly translated to IRW-Ll as follows. 

Corollary 2 (IRW-Ll). The IRW-Ll algorithm from Algo¬ 
rithm has the following interpretations: 

1) MM applied to dJ]) under the log-sum penalty 

L 

Rfsix-,e) =^\og{€+\ iIjJx\), (48) 

recalling the definition of from 0, 

2) as e ^ 0, MM applied to (O under the Iq penalty 

L 

Roi^)=T.h^I^\>o^ (49) 

3) MM applied to Bayesian MAP estimation under an 
AWGN likelihood and the hierarchical prior 

fi \ 

p{x\\) = n y (50) 

A~ i.i.d. r(0,e"^) (51) 

where Zi = ipjx is Laplacian given A;, and Aj is Gamma 
distributed with scale parameter and shape pa¬ 
rameter zero, which becomes Jejfrey’s non-informative 
hyperprior p{Xi) oc lAi>o/^i when e = 0. 

4) variational EM under an AWGN likelihood and the prior 

p{x;X) cx Y[^exp{-Xi{\ipjx\+e)). (52) 

1=1 

which, when e = 0, is independent Laplacian on 
2; = 'S'a: € K.^ under the positive deterministic scale 
parameters in A. 

While Part [T] and Part |2] of Corollary |2] were established 
for the £2-constrained synthesis-CS formulation of IRW-Ll in 
ifTSll . we believe that Part [3 and Part 0] are novel interpretations 
of IRW-Ll. 


III. The Co-IRW-Ll algorithm 
We now propose the Co-IRW-Ll-e algorithm, which is 
summarized in Algorithm |3] Co-IRW-Ll-e can be thought 
of as a hybrid of the Co-Ll and IRW-Ll approaches from 
Algorithms [U and m respectively. Like with Co-Ll, the Co- 
IRW-Ll-e algorithm uses sub-dictionary dependent weights Xd 
that are updated at each iteration t using a sparsity metric on 
S'da;^^. But, like with IRW-Ll, the Co-IRW-Ll-e algorithm 
also uses diagonal weight matrices that are updated 

at each iteration. As with both Co-Ll and IRW-Ll, the 
computational burden of Co-IRW-Ll-e is dominated by the 
L2-PL1 minimization problem in line 0] of Algorithm [S] which 
is readily solved by existing techniques like MLISTA. 

THE Co-IRW-Ll-e algorithm can be interpreted in various 
ways, as we detail below. Lor clarity, we hrst consider fixed 
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Algorithm 3 The Real-Valued Co-IRW-Ll-e Algorithm 
1 : input: y, 7 > 0, ed > 0 Vd, e > 0, 

2 : initialization: = 1 Vd, = IMd 

3: for i = 1,2,3,... 


D 


4: ^ argmin7||y — $a;||f + ^ A^* ||wj;* \Irda;||] 

Ld 


d=l 


5: A 


(i+1) 


^^l0g(l+£ + 


Z=1 


Cd 


-1 


1 , 


6: ^ diag 


7: end 

8: output: £cd) 


1 


ed(l + e) + I 

1 


ed(l + e) + \'tpd,Ld^ 


(t)| 


, Vd 


regularization parameters e = [ei,..., and later, in 
Sec. IIII-Fl we describe how they can be adapted at each iter¬ 
ation, leading to the Co-IRW-Ll algorithm. Also, to simplify 
the development, we first consider the real-valued case and 
discuss the complex-valued case later, in Sec. IIII-Gl 

Theorem 3 (Co-IRW-Ll-e). The real-valued Co-IRW-Ll-e 
algorithm in Algorithm^has the following interpretations: 

1) MM applied to (O under the log-sum-log penalty 

D Ld 


Ri,i{x;e,e) = X! X! 


d^l Z=1 
Ld 


(ed(l + £) + \tpd^ix\) 


^ log ( 1 -b e -b 


i=l 


id 


(53) 


2) fli £ —>■ 0 and ed —^ 0 Vd, MM applied to (O under the 
^0 + ^0,0 penalty 

D 

-C'd l||^'da!||o>07 (54) 


i?o-oo(a;) ^ llvhatllo 


3) MM applied to Bayesian MAP estimation under an 
AWGN likelihood and the hierarchical prior 

L) Ld ^ / \nuL ^|\ + 


p{x 


l^d \ / 


Z=1 

D 


l + £ + 


\'^dM 

Ed 


(55) 


P(A) = I]^p(Ad), p(Ad) oc I ^ , (56) 


d=l 


0 else 


where, when £ = 0, the variables ^dX £ 

are i.i.d. generalized-Pareto MS given Ad, and p{\d) 
is Jeffrey’s non-informative hyperprior MSHIV for the 
random shape parameter Ad- 

4) variational EM under an AWGN likelihood and the prior 


D Ld 


p{x-X,e) oc 


d=l 1=1 


Ad — 1 
2ed 


l-b£-b 


Cd 


— Ad 


( 57 ) 


where, when £ = 0, the variables Zd = ^dX £ 
are i.i.d. generalized-Pareto with deterministic shape 
parameter Ad > 1 and scale parameter Cd > 0. 

Proof: See Sections IlII-AI to [lII-EI below. ■ 

As with Co-Ll, the MM interpretation implies convergence 
(in the sense of an asymptotic stationary point condition) when 
£ > 0, as detailed in Sec. nnn 


A. Log-Sum-Log MM Interpretation of Co-IRW-Ll-e 
Consider the optimization problem 

argmin7||y - $a:|2-b i?isi(a;;e,£) (58) 

X 

with i?isi defined in ( |53] ). We attack this optimization problem 
using the MM approach detailed in Sec. III-AI The difference 
is that now the function g 2 is defined as 


92{v) 


D 


= ^ ^ log (ed(l -b £) -b Ufe) ^ log f 1 -b £ -b — 
d=l keKd ^ i&ICd ^ 


(59) 


D 


= E 


Ldlog E + ^ + “ 


i&K-d 


+ E log(ed(l-b£)-bt;fc) 

k^lCd 

which has a gradient of 


(60) 


(61) 


/ 


^cZ(fc) 




-bi 


/ 


ed(fe)(l + e) + Wfc 


(i) 


(62) 


when d{k) ^ 0 and otherwise [Vp2('W*'*^)]fc = 0. Thus, 
recalling (fTTl) . MM prescribes 


D 


( 


.y(i-i-i) _ argmin^^ 


d—1 keKd 

Vk 


Ld 


\ 


-bl 


, ed(l -b £) -b 


V log(l + e+^) 

XietCd ^ / 

^ ) +l\\v- [0 ^]u||2, (63) 


or equivalently 

= argminy]y]A 


= aro" min 

cZ=l Z=1 


\'^\lX\ 


ed(l -b£) -b 


-b7||y- ^£c|| 


(64) 


for 


A 


(t-Hi) _ 


Ld 


-y:iog(i+.+ 


Z=1 


T.rr(0| 




Cd 


-1 


+ 1, (65) 


which coincides with Algorithmic This establishes Part [T] of 
Theorem [C 














































B. Convergence of Co-IRW-Ll-e 

The convergence of Co-IRW-Ll-e (in the sense of an 
asymptotic stationary point condition) for e > 0 can be shown 
using the same procedure as in Sec. III-BI To do this, we 
only need to verify that the gradient Vp2 in (IhTl i is Lipschitz 
continuous when e > 0, which we do in Appendix lO 


C. Approximate £q + fo,o Interpretation of Co-IRW-Ll-e 

Recalling the discussion in Sec. III-CI we now consider the 
behavior of the i?isi(a;; e, s) regularizer in (l5^ as e —0 and 
Ed ^ 0 Vd. For this, it helps to decouple (l53l l into two terms: 


D La 


e, e) = XIXI + e) + 


( 66 ) 


Z=1 
D Ld 


EE log E ^^+ 






Ed 


Defining Qd = Qd,i changing the variable of 

integration to = XdQd, we find 


D 


p{x) oc 


3 Qd 


cZ=l 


{2edQd)^‘‘ Jo 


dTd (70) 


oc 


(id-1)! 

D 

HUdEEiog(i+e + ^) 


n 


1 


1 = 1 1 + £ + 




D La 

nn 

Ld 


^ed(l + e) + \ipJiix\^ 


i=l 


X log 1 + e + 


\Wd^iX\ 

ed 


0 -1 


which implies that 

— logp(a;) = const + R\a[x-, e, e) 


(71) 


(72) 


(73) 


As Cd —0 Vd, the first term in (|6^ contributes an infinite 
valued “reward” for each pair (d, 1) such that Ir/?^ ix\ = 0, 
or a finite valued cost otherwise. As for the second term, we 
see that lime^o.ed^o log (l + e + l'0d.*a;|/ed) = 0 if 
and only if |■0^ja;| = 0 Vi G i-e., if and only 

if Il’S'dtcllo = 0. And when ||4'd£c||o = 0, the second term 
in (l6^ contributes Ld infinite valued rewards. In summary, 
as e 0 and 0 Vd, the first term in (l66l l behaves like 
||\I/a;||o and the second term like the weighted ^o.o quasi-norm 
'L!d=i^dX\\’^ax\\o>ti^ Stated in (l54l i. This establishes Part|2] 
of Theorem [3 


D. Bayesian MAP Interpretation of Co-IRW-Ll-e 


for Rm{x-, e, e) defined in (l5Tt . 

Plugging (|73 i into (|28]) . we see that 

atMAP =argmin7||y-$a;||f+ i?isi(a;;e,£), (74) 

X 

which is equivalent to the optimization problem in (l5^ . We 
showed in Sec. IIII-AI that, by applying the MM algorithm 
to (fSSl) . we arrive at Algorithm [3 This establishes Part [3 of 
Theorem [3 

E. Variational EM Interpretation of Co-IRW-Ll-e 

To justify the variational EM (VEM) interpretation of Co- 
IRW-Ll-e, we closely follow the approach used for Co-Ll in 
Sec. III-EI The main difference is that now the prior takes the 
form of p{x\ A, e) from (fSTl i. Thus, (l42li becomes 


To show that Co-IRW-Ll-e can be interpreted as Bayesian 
MAP estimation under the hierarchical prior (ISSTl-dShll, we first 
compute the prior p{x). To start. 


p{x)= [ p(A)p(a;|A)dA 

JrE} 


(67) 



(^1 + £ + 


- dAd. 

ed J 

( 68 ) 


Writing (1 -f £ -f \xpl ix\/ed) = exp(-(Ad -f l)Qd,i) 

for Qd,i = log(l -f £ -f \ipl ix\/€d), we get 


logp(a;; A,e) 
L) Ld 

=xx 

L 

+ const 


log 


Ad — 1 
ed 


- Ad log 1 -I- £ -I- 


\'tPd.ix\ 

ed 


(75) 


and by zeroing the gradient w.r.t. A we see that the M step 
(143]) becomes 


v(‘-Pl) 


1 1 , 

- = — log 

- 1 Ld 


1 -f £-f 




MAPI 


ed 


d = 1, 


,D, 


(76) 

where again denotes the MAP estimate of x under A = 
a 1*1. Prom (12^ and (iSTl i. we see that 


D 

p{x) oc 

d=l 


1 

[2ed)^^ 



\La — t 


g (.M+i)T.iAQd,i 


D La 

®MAP =argminXl^d^X^°s(l^4(®l +^<^(1 + ^)) 

* d^l 

y\\y - ^x\\l, 


(69) 


(77) 
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which (for e = 0) is a -weighted version of the IRW-Ll 
log-sum optimization problem (recall Part [T] of Corollary |2]i. 
To solve ( ITTT i. we apply MM. With a small modification of the 
MM derivation from Sec. III-AI we obtain the 2-step iteration 




D 


a^MAP = argniin 7 ||y- + ^ 

(78) 




= diag 


1 


ed(l-f e)-f |r/Jd,ia;W|’ 
1 


ed(l -I- e) -f I'j/’J 


(i)| 


(79) 


By using only a single MM iteration per VEM iteration, the 
MM index “i” can be rewritten as the VEM index “t,” in which 
case the VEM algorithm becomes 


D 


X 


(t) _ 


= argmin 7 ||y - ^x\\l + ^ || 




= diag 


ed(l-f e)-f 
1 


ed(l + e) + |’/’d.Ld® 


(‘)| 


At+i) 


Ld 


log fl 


-f e-f 


ed 


,Vd 

-1 


(80) 


(81) 


+ 1, Vd, 

(82) 


which matches the steps in Algorithm|3] This establishes Part|4] 
of Theorem [3 


F. Co-IRW-Ll 


Algorithm 4 The Co-IRW-Ll Algorithm 

1 : input: y, 7 > 0, e > 0 

2 : if 'S'a; G use A= (1, oo) and logp(a;; A, e) from f75l : 
if ^x G C^, use A= (2, oo) and logp(a:; A, e) from j54l . 

3: initiaiization: = 1 Vd, = I Wd 

4: for f = 1,2,3,... 

D 

5: a;0) ^argmin7||y-«>a;||^-f 

® J 1 

d—1 

6: inax logp(at(*^ A, e), 

XdGA,ed>0 

d=l,...,D 


7: ^ diag 


8: end 

9: output: a;0) 


ed’^^^(l + e) + 

+ e) + \^p\L^xW\}' 


Vd 


to the case of complex-valued '^dX. Eor this, we focus on 
the Co-IRW-Ll algorithm, since Co-IRW-Ll-e follows as the 
special case where e is fixed at a user-supplied value. 

Recalling that Co-IRW-Ll was constructed by generalizing 
the VEM interpretation of Co-IRW-Ll-e, we reconsider this 
VEM interpretation for the case of complex-valued ^dX. In 
particular, we assume an AWGN likelihood and the following 
complex-valued extension of the prior (l57l) : 


D La 


p{x;X,e) oc 




(Ad-l)(Arf-2) 

27reH 


l + e + 


\'tPd.ix\ Y^'‘ 

ed ) 

(83) 


Until now, we have considered the Co-IRW-Ll-e parameters 
e = [ei,..., cdY to be fixed and known. But it is not clear 
how to set these parameters in practice. Thus, in this section, 
we describe an extension of Co-IRW-Ll-e that adapts the e 
vector at every iteration. The resulting procedure, which we 
will refer to as Co-IRW-Ll, is summarized in Algorithm |4] 
Although there does not appear to be a closed-form solution 
to the joint maximization problem in line |6] of Algorithm H) it 
is over two real parameters and thus can be solved numerically 
without a significant computational burden. 

Algorithm 0] can be interpreted as a generalization of the 
VEM approach to Co-IRW-Ll-e that is summarized in Part|4] 
of Theorem[3and detailed in Sec. lIII-El Whereas Co-IRW-Ll- 
e used VEM to estimate the A parameters in the prior ( fSTl ) for 
a hxed value of e, Co-IRW-Ll uses VEM to jointly estimate 
(A, e) in (fSTl) . Thus, Co-IRW-Ll can be derived by repeating 
the steps in Sec. lEH except that now the maximization 
of logp(a;; A, e) in (fTSl) is performed jointly over (A, e), as 
reflected by line |6] of Algorithm H) 

G. Co-IRW-Ll for Complex-Valued '^dX 

In Sections IIII-AlHim the analysis outputs ^dX were 
assumed to be real-valued. We now extend the previous results 


which (for e = 0) is i.i.d. generalized-Pareto on Zd = 
^dX G with deterministic shape parameter Xd > 2 and 
deterministic scale parameter Cd > 0. In this case, the log-prior 
(fTSl l changes to 


logp(at; A,e) 


D La 

= const -t- EE 

d=l 1=1 



- Xd log 


(^1-f e-f 


I'tphxl 

ed 


md-2)\ 

) 

(84) 


which is then maximized over (A, e) in line|6]of Algorithm|4] 


IV. Numerical Results 

We now present results from a numerical study into the 
performance of the proposed Co-Ll and Co-IRW-Ll meth¬ 
ods, given as Algorithm [T] and Algorithm 0] respectively. 
Three experiments are discussed below, all of which focus 
on the problem of recovering an 7V-pixel image (or image 
sequence) x from M-sample noisy compressed measurements 
y — ^x -f w, with M N. In the first experiment, we 
recover synthetic 2D finite-difference signals; in the second 
experiment, we recover the Shepp-Logan phantom and the 
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Cameraman image; and in the third experiment, we recover 
dynamic MRJ sequences, also known as “cines.” 

As discussed in Sec. IkDl Co-Ll can be considered as the 
composite extension of the standard LI-regularized approach 
to analysis CS, i.e., (|2]) under the non-composite LI regularizer 
R{x) = Similarly, Co-IRW-Ll can be considered 

as the composite extension of the standard IRW approach to 
the same problem. Thus, we compare our proposed composite 
methods against these two non-composite methods, referring 
to them simply as “LI” and “IRW-Ll” in the sequel. 

A. Experimental Setup 

For the dynamic MRI experiment, we constructed $ using 
randomly sub-sampled Fourier measurements at each time 
instant with a varying sampling pattern across time. More 
details are given in Sec. IIV-DI For the other experiments, 
we used a “spread spectrum” operator of the form 
$ = DFC, where C G is diagonal matrix with i.i.d 

equiprobable ±1 entries, F G is the discrete Fourier 

transform (DFT), and D G ^ row-selection operator 

that selects M rows of FC G uniformly at random. 

In all cases, the noise w was zero-mean, white, and circular 
Gaussian (i.e., independent real and imaginary components of 
equal variance). Denoting the noise variance by we define 
the measurement signal-to-noise ratio (SNR) as ||j/|| 2 /(M(t^) 
and the recovery SNR of signal estimate x as ||a;||2/||a; —Sjl^. 

Note that, when x is real-valued, the measurements y will 
be complex-valued due to the construction of Thus, to 
allow the use of real-valued LI solvers, we split each complex¬ 
valued element of y (and the corresponding rows of $ and 
w) into real and imaginary components, resulting in a real- 
only model. However, to avoid possible redundancy issues 
caused by the conjugate symmetry of the noiseless Fourier 
measurements FCx, we ensured that D selected at most one 
sample from each complex-conjugate pair. 

We used MFISTA to implement the L2 h-L 1 opti¬ 
mization needed for all methods. The maximum number of 
outer, reweighting iterations for Co-Ll and Co-IRW-Ll was 
set to 16, while the maximum number of inner MFISTA 
iterations was set at 60, with early termination if — 

< 1 X 10“®. In all experiments, we used 
7 = 1/cr^ (as motivated before (|6]l) and e = 0 = e. 

B. Synthetic 2D Finite-Difference Signals 

Our first experiment aims to answer the following question. 
If we know that the sparsity of ’S'i a; differs from the sparsity of 
^ 2 X, then can we exploit this knowledge for signal recovery, 
even if we don’t know how the sparsities are different? This 
is precisely the goal of composite regularizations like (|4]i. 

To investigate this question, we constructed 2D signals with 
finite-difference structure in both the vertical and horizontal 
domains. In particular, we constructed X = atil^ -f la;^, 
where both Xi G and X 2 G are finite-difference 
signals and 1 G R"^® contains only ones. The locations of the 
transitions in ati and a; 2 were selected uniformly at random 
and the amplitudes of the transitions were drawn i.i.d. zero- 
mean Gaussian. The total number of transitions in Xi and X2 




Fig. 1: Examples of the 2D finite-difference signal X used in the first 
experiment. On the left is a realization generated under a transition 
ratio of q: = 14/14 = 1, and on the right is a realization generated 
under a = 27/1 = 27. 


was fixed at 28, but the ratio of the number of transitions in 
Xi to the number in X2, denoted by a, was varied from 1 
to 27. The case a = 1 corresponds to X having 14 vertical 
transitions and 14 horizontal transitions, while the case a = 27 
corresponds to X having 27 vertical transitions and a single 
horizontal transition. (See Fig. [T] for examples.) Finally, the 
signal X G R^ appearing in our model ([T]i was created by 
vectorizing X, yielding a total of X = 48^ = 2304 pixels. 

Given x, noisy observations y = ^x -f w were generated 
using the random “spread spectrum” measurement operator $ 
described earlier at a sampling ratio of M/N = 0.25, with 
additive white Gaussian noise (AWGN) w scaled to achieve 
a measurement SNR of 40 dB. All recovery algorithms used 
vertical and horizontal finite-difference operators \Fi and 
respectively, with (F = in the non-composite case. 

Figure |2] shows recovery SNR versus a for the non¬ 
composite LI and IRW-Ll techniques and our proposed Co-Ll 
and Co-IRW-Ll techniques. Each SNR in the figure represents 
the median value from 25 trials, each using an independent 
realization of the triple (^,x,w). The figure shows that the 
recovery SNR of both LI and IRW-Ll is roughly invariant to 
the transition ratio a, which makes sense because the overall 
sparsity of \Fa: is fixed at 28 transitions by construction. In 
contrast, the recovery SNRs of Co-Ll and Co-IRW-Ll vary 
with a, with higher values of a yielding a more structured 
signal and thus higher recovery SNR when this structure is 
properly exploited. 

C. Cameraman and Shepp-Logan Recovery 

For our second experiment, we investigate algorithm per¬ 
formance versus sampling ratio M/N when recovering the 
well-known Shepp-Logan phantom and Cameraman images. 
In particular, we used the W = 96 x 104 cropped real-valued 
Cameraman image and the = 96 x 96 complex-valued 
Shepp-Logan phantom shown in Fig. |2 and we constructed 
compressed noisy measurements y using spread-spectrum $ 
and AWGN to at a measurement SNR of 30 dB in the 
Cameraman case and 40 dB in the Shepp-Logan case. 

For the Cameraman image, we constructed the analysis 
operator ’S' G concatenating undecimated dbl 

and db2 2D wavelet transforms (UWT-dbl-db2) with one 
level of decomposition. For the Shepp-Logan phantom image, 
we constructed the analysis operator ’S' G fj-om the 
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Fig. 2: Recovery SNR versus transition ratio a for the first ex¬ 
periment, which used 2D finite-difference signals, spread-spectrum 
measurements at M/N = 0.25, AWGN at 40 dB, and finite- 
difference operators for ^d- Each recovery SNR represents the 
median value from 25 independent trials. 



Fig. 3: Left: the real-valued cropped Cameraman image of size N = 
96 X 104. Right: the complex-valued Shepp-Logan phantom of size 
= 96 X 96. For the Shepp-Logan phantom, the real and imaginary 
parts of X were identical, and only the real part is shown here. 


undecimated dbl 2D wavelet transform (UWT-dbl) with one 
level of decomposition. The Co-Ll and Co-IRW-Ll algorithms 
treated each of the sub-bands of the wavelet transform as a 
separate sub-dictionary iFd in their composite regularizers. 

Fig. a shows recovery SNR versus sampling ratio M/N 
for the Cameraman image, while Fig. |5] shows the same for 
the Shepp-Logan phantom. Each recovery SNR represents the 
median value from 7 independent realizations of w). Both 
figures show that Co-Ll and Co-IRW-Ll outperform their non¬ 
composite counterparts, especially at low sampling ratios; the 
gap between Co-IRW-Ll and and IRW-Ll closes at M/N > 
0.35 for the Shepp-Logan phantom. 

D. Dynamic MRI 

For our third experiment, we investigate a simplified ver¬ 
sion of the “dynamic MRI” (dMRI) problem. In dMRI, one 
attempts to recover a sequence of MRI images, known as an 
MRI cine, from highly under-sampled “k-t-domain” measure¬ 
ments constructed as 

y^ = ^txt+wt, (85) 



Fig. 4; Recovery SNR versus sampling ratio M/N for the cropped 
Cameraman image. Measurements were constructed using a spread- 
spectrum operator and AWGN at 30 dB SNR, and recovery used 
UWT-dbl-db2 at one level of decomposition. Each SNR value 
represents the median value from 7 independent trials. 



sampling ratio M/N 


Fig. 5: Recovery SNR versus sampling ratio M/N for the Shepp- 
Logan phantom. Measurements were constructed using a spread- 
spectrum operator and AWGN at 40 dB SNR, and recovery used 
UWT-dbl at one level of decomposition. Each recovery SNR repre¬ 
sents the median value from 7 independent trials. 


where Xt £ is a vectorized (A^i x Ai2)-pixel image at 

time t, £ ]^MixNiN 2 ^ sub-sampled Fourier operator 
at time t, and Wt £ is AWGN. This real-valued is 
constructed from the complex-valued N 1 N 2 x iViiV2 2D DFT 
matrix by randomly selecting 0.5Mi rows and then splitting 
each of those rows into its real and imaginary components. 
Here, it is usually advantageous to vary the sampling pattern 
with time and to sample more densely at low frequencies, 
where most of the signal energy lies (e.g., Il46ll ). Putting dSSl) 
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Fig. 6; Left: A 144 x 85 spatial slice from the 144 x 85 x 48 
dMRI dataset. Middle: The 144 x 48 spatio-temporal slice used for 
the dMRI experiment. Right: a realization of the variable-density k- 
space sampling pattern, versus time, at M/N = 0.30. 


into the form of our measurement model ([Til, we get 


'Vi 


■$i 


~Xi' 


'wi~ 






+ 


JJt- 


^T_ 


_Xt_ 


_Wt_ 



with total measurement dimension M = MiT and total signal 
dimension N = N 1 N 2 T. 

As ground truth, we used a high-quality dMRI cardiac cine 
X of dimensions = 144, N 2 = 85, and T = 48. The left 
pane in Fig. |6]shows a 144 x 85 image from this cine extracted 
at a single time t, while the middle pane shows a 144 x 48 
spatio-temporal profile from this cine extracted at a single 
horizontal location. This middle pane shows that the temporal 
dimension is much more structured than the spatial dimension, 
suggesting that there may be an advantage to weighting the 
spatial and temporal dimensions differently in a composite 
regularizer. 

To test this hypothesis, we constructed an experiment where 
the goal was to recover the 144 x 48 spatio-temporal profile 
shown in the middle pane of Fig. |6] as opposed to the full 
3D cine, from subsampled k-t-domain measurements. For this 
purpose, we constructed measurements {y}JLi as described 
above, but with iV2 = 1 (and thus a ID DFT), and used 
a variable density random sampling method. The right pane 
of Fig. |6] shows a typical realization of the sampling pattern 
versus time. Finally, we selected the AWGN variance that 
yielded measurement SNR = 30 dB. 

For the non-composite LI and IRW-Ll algorithms, we 
constructed the analysis operator ’S' € '^^nxn ^ vertical 
concatenation of the dbl-db3 orthogonal 2D discrete wavelet 
bases, each with two levels of decomposition. For the Co- 
L1 and Co-IRW-Ll algorithms, we assigned each of the 21 
sub-bands in S' to a separate sub-dictionary 
Note that the sub-dictionary size Ld decreases with the level 
in the decomposition. By weighting certain sub-dictionaries 
differently than others, the composite regularizers can exploit 
differences in spatial versus temporal structure. 



sampling ratio M/N 

Fig. 7: Recovery SNR versus sampling ratio M/N for the dMRI 
experiment. Each SNR value represents the median value from 7 
independent trials. Measurements were constructed using variable- 
density sub-sampled Fourier operator and AWGN at 30 dB measure¬ 
ment SNR, and recovery used a concatenation of dbl-db3 orthogonal 
2D wavelet bases at two levels of decomposition. 



LI C 0 -LI 


IRW-L1 C 0 -IRW-LI 


Fig. 8: Recovered dMRI spatio-temporal profiles at M/N = 0.30 


Fig. [7] shows recovery SNR versus sampling ratio M/N for 
the four algorithms under test. Each reported SNR represents 
the median SNR from 7 independent realizations of ($,iu). 
The figure shows that Co-Ll outperforms its non-composite 
counterparts at all tested values of M/N, while Co-IRW-Ll 
outperforms its noncomposite counterpart for M/N < 0.4. 
Although not shown here, we obtained similar results with 
other cine datasets and with an UWT-dbl-based analysis 
operator. 

For qualitative comparison, Fig.j^shows the spatio-temporal 
profile recovered by each of the four algorithms under test at 
M/N = 0.3 for a typical realization of (4>,tu). Compared 
to the ground-truth profile shown in the middle pane of 
Fig. [6] the profiles recovered by LI and IRW-Ll show visible 
artifacts that appear as vertical streaks. In contrast, the profiles 
recovered by Co-Ll and Co-IRW-Ll preserve most of the 
features present in the ground-truth profile. 
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Shepp-Logan 

Cameraman 

MRI 

Ll 

8.12 

9.88 

22.0 

Co-Ll 

8.83 

12.8 

21.7 

IRW-Ll 

7.95 

12.7 

24.1 

Co-IRW-Ll 

9.29 

16.9 

29.6 


TABLE I; Computation times (in seconds) for the presented exper¬ 
imental studies. The times are averaged over trial runs and different 
sampling ratios. 

E. Algorithm Runtime 

Table |I] reports the average runtimes of the LI, Co-Ll, 
IRW-Ll, and Co-IRW-Ll algorithms for the experiments in 
Sections [IV-CI and IIV-DI There we see that the runtime of Co- 
Ll was 1.29x that of LI for the worst case, and the runtime 
of Co-IRW-Ll was 1.33x that of IRW-Ll for the worst case. 

E Choice of Dictionary 

In our last experiment, we investigate the performance of 
Co-IRW-Ll versus choice of For this, we constructed 

using a concatenation of either undecimated or or¬ 
thogonal 2D Daubechies wavelet transforms, and we varied 
both the number of transforms in the concatenation as well 
as the number of levels in the wavelet decomposition. We 
then attempted to recover the Cameraman image from spread- 
spectrum measurements at M/N = 0.4 in AWGN at 30 dB 
SNR. As usual, the Co-IRW-Ll algorithm treated each wavelet 
sub-band as a separate sub-dictionary. 

The recovery SNR for various choices of is shown in 
Fig. |9] For the case of orthogonal wavelet transforms (OWT), 
a significant performance improvement was observed in going 
from one to two transforms, regardless of the wavelet decom¬ 
position level. However, a slight performance degradation was 
observed when concatenating more than two OTWs. Moreover, 
the effect of varying the level of decomposition was mild 
unless no concatenation (i.e., dbl) was used. For the undeci¬ 
mated wavelet transform (UWT) case, the recovery SNR was 
essentially invariant to both the level of decomposition and 
the number of concatenated transforms, with only a slight 
degradation when five transforms were concatenated. Overall, 
the UWT performed significantly better than the OWT. Similar 
trends were observed for the Co-Ll algorithm in experiments 
not shown here. 

V. Conclusions 

Motivated by the observation that a given signal x admits 
sparse representations in multiple dictionaries but with 
varying levels of sparsity across dictionaries, we proposed 
two new algorithms for the reconstruction of (approximately) 
sparse signals from noisy linear measurements. Our first algo¬ 
rithm, Co-Ll, extends the well-known lasso algorithm Engl 
from the LI penalty ||’S'a:||i to composite LI penalties of the 
form dUl while self-adjusting the regularization weights A^. 
Our second algorithm, Co-IRW-Ll, extends the well-known 
IRW-Ll algorithm 09I12I13I to the same family of composite 
penalties while self-adjusting the regularization weights \d and 
the regularization parameters Cd- 

We provided several interpretations of both algorithms: i) 
majorization-minimization (MM) applied to a non-convex log- 
sum-type penalty, ii) MM applied to an approximate ^o-type 



Fig. 9: Co-IRW-Ll recovery SNR for different choices of Afd- 
Measurements were constructed from the cropped cameraman image 
using a spread-spectrum operator, AWGN at 30 dB SNR, and 
sampling ratio M/N = 0.40. Here, OWT represents a concatenation 
of 2D orthogonal Daubechies wavelet transforms, UWT represents 
a concatenation of 2D undecimated Daubechies wavelet transforms, 
and “Ivl” denotes the level of decomposition. Each SNR value 
represents the median value from 3 independent trials. 


penalty, iii) MM applied to Bayesian MAP inference under a 
particular hierarchical prior, and iv) variational expectation- 
maximization (VEM) under a particular prior with deter¬ 
ministic unknown parameters. Also, we leveraged the MM 
interpretation to establish convergence in the form of an 
asymptotic stationary point condition ifT^ . Furthermore, we 
noted that the Bayesian MAP and VEM viewpoints yield 
novel interpretations of the original IRW-Ll algorithm. Finally, 
we present a detailed numerical study that suggests that our 
proposed algorithms yield significantly improved recovery 
SNR when compared to their non-composite LI and IRW-Ll 
counterparts with a modest (e.g., 1.3x) increase in runtime. 
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Appendix A 

LIPSCHITZ continuity OE Co-Ll GRADIENT 

In this appendix, we establish the Lipschitz continuity of 
Vp2 from (fT^ in the case that e > 0. We first recall that, for 
Vp2 to be Lipschitz continuous over the domain v G C, there 
must exist some constant (3 such that, for all v^v' G C, 

\\yg2{v)-Vg2{v')\\l</3\\v-v'\\l (87) 
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From < fT8l ), we have 


\\Vg2{v)-Va2{v')\\l 

L 

-E 


L 


L 


< d { k ) 


L, 


' d { k ) 


-E 


'i^K.d(k) 


Vi 




u- 

'ieKd(fc) » 


= E 


fc=i le + E.gKdf.) + E.gKdf.) ^0' 


£> id 

d=l i=l (e + EigK^ffc) ^0 (e + EiG/Cd(fc) ^i) 

We can then upper bound the latter as follows. 

\\^92{.v)-^g2{v')\\l< 

d=l 1^1 ^ 


D r Ld 

d—1 ^ 2—1 

D ^4 Ld 


n 2 


sE^Ew..-“4.f 

d=l i=l 

/c^l 

4 L+AT 

E 


^ -^max 


fc=l 




1 


AT 


log(l/e) 

1 


E 


log(l/e 
= ||at||o-A^ + 


E l0g(£)+ E l0g(e+|2;n|) 

n:Xn—0 niXrt^O 

E„:x„ 5 ^o log(e+ |a;„|) 


log(l/e) 


in dal, we have 


N 


\y-^x\\l + ^log(e+ |x„|) 


( 88 ) 

(89) 

(90) 


n—1 


OC 


7 


log(l/e) 


A / 

= 7 


\y-^x\\l + ||a:||o -N + 


E log(e+|a;„|) 

n: Xn^O 


log(l/e) 


( 100 ) 


Clearly the global scaling and offset by N in (1 100b are 
inconsequential to the minimization in (l24b . Furthermore, by 
making e > 0 arbitrarily small, we can make the last term 
in (1 100b arbitrarily sma ifl and thus negligible compared to 
the other terms. It is in this sense that we say that (l24b is 
equivalent to (l25T l as e —0. 


(91) 

(92) 

(93) 

(94) 

(95) 

(96) 


Appendix C 

Lipschitz continuity of Co-IRW-Ll-e gradient 

In this appendix, we establish the Lipschitz continuity of 
Vp2 from dMT l in the case that e > 0, recalling the Lipschitz 
definition dSTl) . To ease the exposition, we focus on the L = 1 
case, noting that a similar (but more tedious) technique can 
be applied to the general case. 

From the L = 1 case of dMl l. we have 

\Wg2{v)-Wg2{v')\^ 

1 


where dMI ) follows from the fact that Ud.i > 0 Vd, I (according 
to (fT3]) l. ( |9^ follows from the fact that ||a;||i < •\/V||a:||2 for 
X G C^, and ( l94b uses Lmax — max^j Ld- Comparing ( l96b to 
dSTb . we see that Vp2 from ( fTSl l is Lipschitz continuous. 


Appendix B 

Equivalence of Log-Sum and 4 Minimization 

In this appendix, we establish that the log-sum optimization 
(l24b becomes equivalent to the £q optimization (l25T l as e —0. 
We first note that, for any e > 0, 


l0g(l -f £-f 

1 


+ 1 


ei(l -f e) -f ' 


1 2 


ei(l -I- e) -f v ' 


(97) 

(98) 

(99) 


,log(l -f e-f 

= [A + Br 

< [\A\ + \B\]^ <2[A^ +B^], 
since ||a;||i < for x G C^, and where 

1 A 1_ 1 

ei(l-f £)-f u ei(l-I-£)-I-u' 

5 A _i_ 

(ei (1 -f £) -f v) log(l + £ + ^) 

1 

(ei(l -f £) -f v') log(l +e+^) 
Examining A^, we find that 

d2 ^ 1 1 " 

A^ = 


( 101 ) 

( 102 ) 

(103) 


(104) 


(105) 


ei(l -I- £)-f u ei(l -I-£)-I-u' 
ei (1 -I- £) -I- u' - [ei (1 -I- £) -I- u] 
[ei(l-f£)-fz;][ei(l-f£)-f u'] 

<(v'-vr/et 


(106) 

(107) 

(108) 


where ||a;||o is defined as the counting norm, i.e., ||a;||o = 
\{xn '■ Xn ^A}\. Applying this result to the objective function 


'^Note that, as e —0, the numerator of the last term in {100\ converges to 
the finite value En-ainT^O while the denominator grows to + 00 . 
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since ei,e > 0 and v^v' > 0. Next, we write as 

—V 

log(a')/ 

1 /a'log(a') — alog(a)\^ 
ef \ a log(a)a'log(a') J 




;log(a) 


where the latter step used a, a' > 1 and 1 + e > 0 and 
v/f-i > 0. Putting together ( 1103b . ( 1108b . (1114b . (1118b . ( 1123b 
(JQ9) and ( 1125b . we see that there exists ^ > 0 such that 

\y 92 {v)-yg 2 {v')\^ <l3{v' -vf VK.u) eC, (126) 

implying that Vp2 is Lipschitz continuous. 


with a = 1 + e + — and a' = 1 + e + —, and realize 

ei ei 
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